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Abstract. AdaBoost [4] is a well-known ensemble learning algorithm 
that constructs its constituent or base models in sequence. A key step 
in AdaBoost is constructing a distribution over the training examples 
to create each base model. This distribution, represented as a vector, is 
constructed to be orthogonal to the vector of mistakes made by the pre- 
vious base model in the sequence [6]. The idea is to make the next base 
model’s errors uncorrelated with those of the previous model. In previous 
work [8], we developed an algorithm, AveBoost, that constructed distri- 
butions orthogonal to the mistake vectors of all the previous models, and 
then averaged them to create the next base model’s distribution. Our ex- 
periments demonstrated the superior accuracy of our approach. In this 
paper, we slightly revise our algorithm to allow us to obtain non- trivial 
theoretical results: bounds on the training error and generalization error 
(difference between training and test error). Our averaging process has a 
regularizing effect which, as expected, leads us to a worse training error 
bound for our algorithm than for AdaBoost but a superior generaliza- 
tion error bound. For this paper, we experimented with the data that 
we used in [8] both as originally supplied and with added label noise — a 
small fraction of the data has its original label changed. Noisy data are 
notoriously difficult for AdaBoost to learn. Our algorithm’s performance 
improvement over AdaBoost is even greater on the noisy data than the 
original data. 


1 Introduction 

AdaBoost [4] is one of the most well-known and highest-performing ensemble 
classifier learning algorithms [3]. It constructs a sequence of base models, where 
each model is constructed based on the performance of the previous model on 
the training set. In particular, AdaBoost calls the base model learning algorithm 
with a training set weighted by a distribution. 1 After the base model is created, 
it is tested on the training set to see how well it learned. We assume that the 
base model learning algorithm is a weak learning algorithm [5]; that is, with 

1 If the base model learning algorithm cannot take a weighted training set as input, 
then one can create a sample with replacement from the original training set accord- 
ing to the distribution and call the algorithm with that sample. 
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high probability, it produces a model whose probability of misclassifying an 
example is less than 0.5 when that example is drawn from the same distribution 
that generated the training set. The point is that such a model performs better 
than random guessing. 2 The weights of the correctly classified examples and 
misclassified examples are scaled down and up, respectively, so that the two 
groups’ total weights are 0.5 each. The next base model is generated by calling 
the learning algorithm with this new weight distribution and the training set. 
The idea is that, because of the weak learning assumption, at least some of 
the previously misclassified examples will be correctly classified by the new base 
model. Previously misclassified examples are more likely to be classified correctly 
because of their higher weights, which focus more attention on them. Kivinen 
and Warmuth [6] have shown that AdaBoost scales the distribution with the 
goal of making the next base model’s mistakes uncorrelated with those of the 
previous base model. 

AdaBoost is notorious for performing poorly on noisy datasets [3], such as 
those having some examples that were assigned the wrong class label. Because 
these examples are inconsistent with the majority of examples, they tend to 
be harder for the base model learning algorithm to learn. AdaBoost increases 
the weights of examples that the base model learning algorithm did not learn 
correctly. Noisy examples are likely to be incorrectly learned by many of the 
base models so that eventually these examples’ weights will dominate those of 
the remaining examples. This causes AdaBoost to focus too much on the noisy 
examples at the expense of the majority of the training examples, leading to 
poor performance on new examples. 

We previously [8] presented an algorithm, called AveBoost, which calculates 
the next base model’s distribution by first calculating a distribution the same 
way as in AdaBoost, but then averaging it elementwise with those calculated for 
the previous base models. This averaging mitigates AdaBoost’s tendency to in- 
crease the weights of noisy examples to excess. In our previous work we presented 
promising experimental results. However, we did not present theoretical results. 
We were unable to derive a non- trivial training error bound for the algorithm 
presented in [8]. In this paper, we present a slight modification to AveBoost 
which allows us to obtain both a non-trivial training error bound and a gen- 
eralization error bound (difference between training error and test error). We 
call this algorithm AveBoost2. In Section 2, we review AdaBoost. In Section 3, 
we describe the AveBoost2 algorithm and state our modification to what we 
presented previously. In Section 4, we present our training error bound and gen- 
eralization error bound. The averaging in our algorithm has a regularizing effect; 
therefore, as expected, our training error bound is worse than that of AdaBoost 
but our generalization error bound is better than AdaBoost’s. In Section 5, we 
present an experimental comparison of our new AveBoost 2 with AdaBoost on 


2 The version of AdaBoost that we use was designed for two-class classification prob- 
lems. However, it is often used for a larger number of classes when the base model 
learning algorithm is strong enough to have an error less than 0.5 in spite of the 
larger number of classes. 
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AdaBoost({(xi , yi), . - - , (x mi Urn ) 

Initialize tfi.i = 1 jm for all * € {1,2,. . . ,m}. 

For t = 1,2, ... ,T: 

h t = L b ({(xi,yi), (x m , y TO )}, d t ). 

Calculate the error of h t : e t = J2i-.h,(xi)ytyi <&.«• 
If e t > 1/2 then, 

set T = t — 1 and abort this loop. 

& = T^7- 

Calculate distribution d t+ i: 


u; l = d t ,i x 


{f 


Pt if h t (x<) = y* 
otherwise. 


dt+i.i — 




E m 

i=l W i 


Output the final hypothesis: 

hf,n(x) = argmax vey Et:*,(«)=, io 5?r- 


Fig. 1. AdaBoost algorithm: {(xi, yi), . . . , (x m , y m )} is the training set, Lb is the base 
model learning algorithm, and T is the maximum allowed number of base models. 


some UCI datasets [1] both in original form and with 10% label noise added. 
Section 6 summarizes this paper and describes ongoing and future work. 


2 AdaBoost 

Figure 1 shows AdaBoost’s pseudocode. AdaBoost constructs a sequence of base 
models ht for t € {1,2, where each model is constructed based on the 

performance of the previous base model on the training set. In particular, Ad- 
aBoost maintains a distribution over the m training examples. The distribution 
di used in creating the first base model gives equal weight to each example 
(di^i — 1/m for all i E {1, 2, ... , m}). AdaBoost now enters the loop, where the 
base model learning algorithm L b is called with the training set and di- 3 The 
returned model is then tested on the training set to see how well it learned. 
The total weight of the misclassified examples (ei) is calculated. The weights 
of the correctly-classified examples are multiplied by ei/(l - ci) so that they 
have the same total weight as the misclassified examples. The weights are then 
normalized so that they sum to 1 instead of 2ei- AdaBoost Assumes that L b is 
a weak learner , i.e., e t < \ with high probability. Under this assumption, the 
total weight of the misclassified examples e t < 1/2 is increased to 1/2 and the 


3 As mentioned earlier, if L b cannot take a weighted training set as input, then we can 
give it a sample drawn with replacement from the original training set according to 
the distribution d induced by the weights. 




Fig. 2. AveBoost2 algorithm: {(xi, yi), . - . , (x m , Vm)} is the training set, Lb is the base 
model learning algorithm, and T is the maximum allowed number of base models. 


total weight of the correctly classified examples 1 — e t > 1/2 is decreased to 
1/2. This is done so that, by the weak learning assumption, h t + 1 will classify at 
least some of the previously misclassified examples correctly. Returning to the 
algorithm, the loop continues, creating the T base models in the ensemble. The 
final ensemble returns, for a new example, the one class in the set of classes Y 
that gets the highest weighted vote from the base models. 

3 AveBoost2 algorithm 

Figure 2 shows our new algorithm, AveBoost2. Just as in AdaBoost, AveBoost2 
initializes = 1/m for all i 6 {1,2, ...,m}. Then it goes inside the loop, 
where it calls the base model learning algorithm Lt with the training set and 
distribution and calculates the error of the resulting base model hi . It then 
calculates ci, which is the distribution that AdaBoost would use to construct the 
next base model. However, AveBoost2 averages this with di to get d 2 , and uses 
this d 2 instead. Showing that the d*’s in AveBoost2 are distributions is a trivial 
proof by induction. For the base case, di is constructed to be a distribution. For 
the inductive part, if dt is a distribution, then d$+i is a distribution because it is 




a convex combination of d t and c t , both of which are distributions. The vector 
d t +i is a running average of di and the vectors c q for q € {1,2,..., t}. 

Returning to the algorithm, the loop continues for a total of T iterations. 
Then the base models are combined using a weighted voting scheme slightly 
different from that of AdaBoost and the original AveBoost from [8]: each modePs 
weight is log(l/(0tlt)) instead of log{ 1/A). AveBoost2 is actually AdaBoost 
with f3 t replaced by Ptlt- However, we wrote the AveBoost2 pseudocode as we 
did to make the running average calculation of the distribution explicit. 

AveBoost 2 can be seen as a relaxed version of AdaBoost. When training ex- 
amples are noisy and therefore difficult to fit, AdaBoost is known to increase the 
weights on those examples to excess and overfit them [3] because many consecu- 
tive base models may not learn them properly. AveBoost2’s averaging does not 
allow the weights of noisy examples to increase rapidly, thereby mitigating the 
overfitting problem. We therefore expect AveBoost2 to outperform AdaBoost on 
the noisy datasets to a greater extent than on the original datasets. 


4 Theory 


In this section, we give bounds on the training error and generalization error 
(difference between training and test error). Not surprisingly, the relaxed nature 
of AveBoost2 relative to AdaBoost caused us to obtain a worse training error 
bound but superior generalization error bound for AveBoost2 relative to Ad- 
aBoost. Due to space limitations, we defer the proofs and more intuition on the 
theoretical frameworks that we use to a longer version of this paper. 

Theorem 1. In AveBoost 2, suppose the weak learning algorithm Lb generates 
hypotheses with errors €i,€ 2 ,-..,er where each et < 1/2. Then the ensemble } s 
error e = 52 i:hfin ( x .^ y . t5 bounded as follows: 


e 


sn 


t + 1 


A 2 + 2MT=77T^ 


1 

4c t (l-c t ) 


This bound is clearly non-trivial (e < 1); but it is greater than that of 
AdaBoost [4]: 


T 

e < 2 T I| \At( 1 - e t)- 

t= 1 

To derive our generalization error bound, we use the algorithmic stability 
framework of [7]. Intuitively, algorithmic stability is very similar to Breiman’s 
notion of stability [2] — the more stable a learning algorithm is, the less of an effect 
changes to the training set have on the model returned. Clearly, the more stable 
the learning algorithm is, the smaller the difference between the training and 
test errors tends to be. We show that AveBoost2 is more stable than AdaBoost; 


therefore, the difference between the training and test errors is lower. We first 
give some preliminaries from [7] and then state our new result. 

For the following, X is the space of possible inputs, y — {0, 1} is the set of 
possible labels, and Z — X xy. 

Definition 1 (Definition 2.5 from [7]). A learning algorithm is a process 
which takes as input a distribution p on Z with finite support and outputs a 
function f p :X^ [0, 1]. For S € Z m for some positive integer m, fs means f p 
where p is the uniform distribution on 5. 

In the following, the error of / on an example (re, y ) is c(/, (x, y)) ~ \f{x) -y |. 

Definition 2 (Definition 2.11 from [7]). A learning algorithm has Li-stability 
A if, for any two distributions p and q on X with finite support, 

Vz € Z, I c(f p , z) - c(fq,z) I < A||p - gill . 

In the following, D is a distribution on Z, S ~ D m is a set of m examples 
drawn from Z according to D , and S t ' u is S with example i € {l,2,...,m} 
removed (each i is chosen with probability 1 fm) and example u ~ D added. 

Definition 3 (Definition 2.14 from [7]). A learning algorithm is (ffi8)-stable 
if 

Ps~D™(\c(fs,z) “ c(/s*.~iz) I < /?) > 1 - <$■ 

Intuitively, fs and f§i . « are models that result from running the learning 
algorithm on two slightly different training sets. As (3 and 8 decrease, the proba- 
bility of having smaller differences in errors between these two models increases, 
which means that the learning algorithm is more stable. Greater stability implies 
lower generalization error according to the following theorem. In the following, 
E TT s{fs) is the training error (error on the training set S) and Erro(fs) is the 
test error, i.e., the error on an example (x, y) chosen at random according to 
distribution D. 

Theorem 2 (Theorem 3.4 from [7]). Suppose a ( j3,8)-stable learning algo- 
rithm returns a hypothesis fs for any training set S ~ D m . Then for all r > 0 
and m > 1, 

( — r 2 m \ 4m 2 5 

Ps~D~(\Err s (fs) - Err D (fs)\ > r + 0 + 6) < 2exp 

Intuitively, this theorem shows that lower values of fi and 8 lead to lower 
probabilities of large differences between the training and test errors. We can 
finally state our theorem on AveBoost2 5 s stability. 

Theorem 3. Suppose the base model learning algorithm has L\-stability A and 
m i n te{i, 2 ,...,r} > e* > 0. Then, for sufficiently large m and for all T , Ave- 

Boost2 is (/?, 8) -stable, where 


Table 1. The datasets used in the experiments 


Data Set 

Training 

Set 

Test 

Set 

Inputs 

Classes 

Promoters 

84 

22 

57 

2 

Balance 

500 

125 

4 

3 

Breast Cancer 

559 

140 

9 

2 

German Credit 

800 

200 

20 

2 

Car Evaluation 

1382 

346 

6 

4 

Chess 

2556 

640 

36 

2 

Mushroom 

6499 

1625 

22 

2 

Nursery 

10368 

2592 

8 

5 

Connect4 

54045 

13512 

42 

3 


Table 2. Performance of AveBoost2 compared to AdaBoost 


ORIGINAL 

Num. Base Models 

10% NOISE 

Num. Base Models 

Base Model 

10 

50 

100 

Base Model 

10 

50 

100 

Naive Bayes 
Decision Trees 
Decision Stumps 

-j-4— 4-1 
-1-2— 6-1 
+1—6-2 

+4=4-1 

+1=6-2 

+1=6-2 

+4=4-1 

+2=6-1 

+1=6-2 

Naive Bayes 
Decision Trees 
Decision Stumps 

+8=1-0 

+6=2-1 

+0=7-2 

+8=1-0 

+5=3-1 

+1=7-1 

+7=2-0 

+6=2-1 

+1=7-1 


(a + i)^ y 


For AdaBoost, the theorem is the same except that [7] 

2 ^ 2 t2 + 1 (A + l) t 

p m k ’ 

which is clearly larger than the corresponding /? for AveBoost2. This means 
that AveBoost2’s generalization error is less than that of AdaBoost by theorem 
2 . 



5 Experimental Results 

In this section, we compare AdaBoost and AveBoost2 on the nine UCI datasets [1] 
described in Table 1. We ran both algorithms with three different values of T, 
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Fig. 3. Test set error rates of AdaBoost Fig. 4. Test set error rates of AdaBoost vs. 
vs. AveBoost2 (Naive Bayes, original AveBoost2 (Naive Bayes, noisy datasets) 
datasets) 


which is the maximum number of base models that the algorithm is allowed to 
construct: 10, 50, and 100. Each result reported is the average over 50 results 
obtained by performing 10 runs *of 5-fold cross-validation. Table 1 shows the 
sizes of the training and test sets for the cross-validation runs. We also repeated 
these runs after adding 10% label noise. That is, we randomly chose 10% of the 
examples in each dataset and changed their labels to one of the remaining labels 
with equal probability. 

Table 2 shows how often AveBoost2 significantly outperformed, performed 
comparably with, and significantly underperformed AdaBoost. For example, on 
the original datasets and with 10 Naive Bayes base models, AveBoost2 signifi- 
cantly outperformed 4 AdaBoost on four datasets, performed comparably on four 
datasets, and performed significantly worse on one, which is written as “+4=4- 
1.” Figures 3 and 4 compare the error rates of AdaBoost and AveBoost2 with 
Naive Bayes base models on the original and noisy datasets, respectively. In all 
the plots presented in this paper, each point marks the error rates of two algo- 
rithms when run with the number of base models indicated in the legend and a 
particular dataset. The diagonal lines in the plots contain points at which the two 
algorithms have equal error. Therefore, points below/above the line correspond 
to the error of the algorithm indicated on the y-axis being less than/ greater than 
the error of the algorithm indicated on the x-axis, respectively. We can see that, 
for Naive Bayes base models, AveBoost2 performs much better than AdaBoost 
overall, especially on the noisy datasets. 

We compare AdaBoost and AveBoost2 using decision tree base models in fig- 
ure 5 (original datasets) and figure 6 (noisy datasets). On the original datasets, 
the performances of the two algorithms are comparable. However, on the noisy 
datasets, AveBoost2 is superior for all except the Balance dataset. On the Bal- 
ance dataset, AdaBoost actually performed as much as 10% better on the noisy 
data than the original data, which is strange, and needs to be investigated fur- 


4 We use a t-test with a = 0.05 to compare all the classifiers in this paper. 
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Fig. 5. Test set error rates of AdaBoost Fig. 6. Test set error rates of AdaBoost 
vs. AveBoost2 (Decision Trees, original vs. AveBoost2 (Decision Trees, noisy 
datasets) datasets) 
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Fig. 7. Test set error rates of AdaBoost Fig. 8. Test set error rates of AdaBoost 
vs. AveBoost2 (Decision Stumps, original vs. AveBoost2 (Decision Stumps, noisy 
datasets) datasets) 


ther. AveBoost2 performed worse on the noisy Balance data than on the original 
Balance data. Figure 7 gives the error rate comparison between AdaBoost and 
AveBoost2 with decision stump base models on the original datasets. Figure 8 
gives the same comparison on the noisy datasets. With decision stumps, the 
two algorithms always seem to perform comparably. We suspect that decision 
stumps are too stable to allow the different distribution calculation methods of 
AdaBoost and AveBoost2 to yield significant differences. 


6 Conclusions 

We presented AveBoost2, a boosting algorithm that trains each base model 
using a training example weight vector that is based on the performances of all 
the previous base models rather than just the previous one. We discussed our 







theoretical results and demonstrated empirical results that are superior overall 
to AdaBoost; especially on datasets with label noise. 

Our theoretical and empirical results do not account for what happens as 
the amount of noise changes. We plan to derive such results. In a longer version 
of this paper, we plan to perform a more detailed empirical analysis including 
the performances of the base models and ensembles on the training and test 
sets, correlations among the base models, ranges of the weights of regular and 
noisy examples, etc. In [8], we performed such an analysis to a limited extent for 
the original AveBoost and were able to confirm some of what we hypothesized 
there and in this paper: the base model accuracies tend to be higher than for 
AdaBoost, the correlations among the base models also tend to be higher, and 
the ranges of the weights of the training examples tends to be lower. We were 
unable to repeat this analysis here due to a lack of space. Such an analysis 
not only will enable us to understand empirically what is occurring but should 
guide our theoretical analysis of the performances of AdaBoost and AveBoost2 
on noisy data. 
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